Skip to content

Creates a new way to test multiple selections for each protein; parallelizes LDDT script#116

Merged
marcuscollins merged 6 commits intomainfrom
new-selection-lists
Mar 2, 2026
Merged

Creates a new way to test multiple selections for each protein; parallelizes LDDT script#116
marcuscollins merged 6 commits intomainfrom
new-selection-lists

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Mar 1, 2026

Sorry, this is a pretty big one. All it should do is what's in the title though.

Summary by CodeRabbit

  • New Features

    • Parallelized LDDT evaluation across multiple selections per protein
    • New CLI options: --target-filename and --n-jobs
  • Configuration Changes

    • ProteinConfig.selection now supports multiple selections (parsed from CSV)
  • Improvements

    • Per-selection reference loading and coordinate aggregation for robust comparisons
    • Enhanced experiment filtering (case-insensitive) and per-selection error logging
  • Utilities

    • Altloc utilities support optional selection filtering and stronger validation

# Conflicts:
#	src/sampleworks/core/rewards/real_space_density.py

# Conflicts:
#	src/sampleworks/utils/atom_array_utils.py
Parallelizing LDDT script with joblib. Closes #98
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 1, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51ee774 and 2945059.

📒 Files selected for processing (6)
  • scripts/eval/rscc_grid_search_script.py
  • src/sampleworks/eval/eval_dataclasses.py
  • src/sampleworks/eval/grid_search_eval_utils.py
  • src/sampleworks/eval/structure_utils.py
  • src/sampleworks/utils/atom_array_utils.py
  • src/sampleworks/utils/density_utils.py

📝 Walkthrough

Walkthrough

Parallelizes LDDT evaluation by splitting per-protein selections into independent jobs, expands reference-loading and selection handling to be per-selection, updates dataclasses and selection utilities to support multiple selections, and aggregates per-selection results into a unified dataframe.

Changes

Cohort / File(s) Summary
LDDT Parallelized Evaluation
scripts/eval/lddt_evaluation_script.py
Added process_exp_with_selection() and joblib Parallel usage to run per-(experiment,selection) LDDT evaluations; changed main flow to preload nested reference caches keyed by (protein,occ,selection), filter experiments case-insensitively, and aggregate null + parallel results.
RSCC Grid Search
scripts/eval/rscc_grid_search_script.py
Switched from hard-coded OCCUPANCY_LEVELS to args.occupancies; discovery accepts target_filename; per-selection result dicts and base_map cache now include selection; added robust error capture per-result.
Dataclass: Multi-selection
src/sampleworks/eval/eval_dataclasses.py
ProteinConfig.selection changed from str to list[str]; from_csv now splits selection on ;; added validation to ensure non-empty selections.
Reference / Selection Handling
src/sampleworks/eval/structure_utils.py
get_reference_structure_coords() now returns dict[str, np.ndarray] keyed by selection; added masked selection support, ATOMWORKS_COMPARISON_OPS, per-selection loading/diagnostics, and deprecation guard for old-style strings.
Altloc & AtomArray Utilities
src/sampleworks/utils/atom_array_utils.py
find_all_altloc_ids() gains altloc_label param; map_altlocs_to_stack() gains selection and return_full_array params, adds input validation, and implements selection-aware masking/filtering before stacking.
CLI / Grid Search Utils
src/sampleworks/eval/grid_search_eval_utils.py
Added CLI options --target-filename (default refined.cif) and --n-jobs (default 16); updated occupancies help text and minor TODO/FIXME comments.
Minor Edits / Comments
src/sampleworks/core/rewards/real_space_density.py, src/sampleworks/utils/density_utils.py
Whitespace, comment tweaks and a TODO insertion; no behavioral changes.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Main Script
    participant Filter as Experiment Filter
    participant RefLoader as Reference Loader
    participant Parallel as joblib.Parallel
    participant Worker as process_exp_with_selection
    participant Aggregator as Result Aggregator

    CLI->>Filter: load experiments & apply case-insensitive protein matching
    Filter->>RefLoader: request reference coords per (protein, occ, selection)
    RefLoader-->>Filter: return nested dict [(protein,occ)][selection]->coords
    CLI->>Parallel: dispatch (exp, selection) tasks
    Parallel->>Worker: for each (exp,selection) -> load predicted structure
    Worker->>Worker: apply selection mask / clustering / compute LDDT
    Worker-->>Aggregator: return per-selection result dict (metrics or error)
    Aggregator->>Aggregator: combine results + null_results -> dataframe
    Aggregator-->>CLI: write CSV / return dataframe
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • k-chrispens

Poem

🐰 Hop-hop — selections split and go,
Each atom masked in a parallel row.
Nested dicts hum, jobs dance in tune,
Results aggregate beneath the moon.
A rabbit cheers: reviews come soon! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the main changes: multiple selections per protein support and LDDT script parallelization via joblib.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch new-selection-lists

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
scripts/eval/lddt_evaluation_script.py (1)

223-241: Load each reference structure once per occupancy, then fan out by selection.

The current product loop reloads and reparses the same reference structure for every selection, which is avoidable overhead.

⚡ Proposed refactor
-    for protein_key, protein_config in protein_configs.items():
-        for occ, sel in itertools.product(args.occupancies, protein_config.selection):
-            ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ)
+    for protein_key, protein_config in protein_configs.items():
+        for occ in args.occupancies:
+            ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ)
             if reference_proteins is None:
                 logger.warning(
                     f"Could not find ref structure for {protein_key} and occupancy {occ}"
                 )
                 continue
+            if (protein_key, occ) not in reference_atom_arrays:
+                reference_atom_arrays[(protein_key, occ)] = {}
+
+            for sel in protein_config.selection:
+                try:
+                    reference_protein_stack, _, _ = map_altlocs_to_stack(
+                        reference_proteins,
+                        selection=translate_selection(sel),
+                        return_full_array=True,
+                    )
+                    reference_atom_arrays[(protein_key, occ)][sel] = reference_protein_stack
+                except Exception as e:
+                    logger.error(
+                        f"Error loading ref structure for {protein_key} and occupancy {occ}: {e}"
+                    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lddt_evaluation_script.py` around lines 223 - 241, The loop
currently calls get_reference_atomarraystack and reparses the same reference for
every selection because itertools.product over args.occupancies and
protein_config.selection repeats loads; change the logic to iterate over
occupancies first (for occ in args.occupancies), call
get_reference_atomarraystack(protein_config, occ) once, skip if None, then
inside that occupancy iterate over selections (for sel in
protein_config.selection) and call map_altlocs_to_stack(reference_proteins,
selection=translate_selection(sel), return_full_array=True) and store into
reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching
check for (protein_key, occ) remains so each reference is only stored once per
occupancy.
scripts/eval/rscc_grid_search_script.py (1)

127-165: Avoid recomputing refined density once per selection.

Parsing CIF and recomputing density inside the selection loop scales linearly with number of selections and can dominate runtime. Compute predicted density once per experiment, then only run per-selection extraction/RSCC.

⚡ Refactor sketch
-        for selection in protein_config.selection:
+        # Load refined structure and compute predicted density once per experiment
+        _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None)
+        atom_array = get_asym_unit_from_structure(_structure)
+        if not hasattr(atom_array, "coord") or atom_array.coord is None:
+            raise AttributeError("AtomArray | AtomArrayStack is missing coordinates")
+        if not hasattr(atom_array, "b_factor"):
+            atom_array.set_annotation("b_factor", np.full(atom_array.coord.shape[-2], 20.0))
+
+        for selection in protein_config.selection:
             ...
-                _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None)
-                atom_array = get_asym_unit_from_structure(_structure)
-                ...
                 _computed_density, _ = compute_density_from_atomarray(
                     atom_array, xmap=_base_xmap, em_mode=False, device=_device
                 )

You can also split cache keys so full base maps are cached by (protein, occ_a) and extracted submaps by (protein, occ_a, selection).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/rscc_grid_search_script.py` around lines 127 - 165, The code
recomputes parsing and computed density inside the per-selection loop (parse,
get_asym_unit_from_structure, compute_density_from_atomarray), causing
O(selections) work; move CIF parsing and density computation out of the
selection loop and cache the results per experiment/occ_a. Specifically: parse
the refined CIF via parse(...) once for each _exp and produce
atom_array/get_asym_unit_from_structure once, then call
compute_density_from_atomarray(...) once to produce a cached _computed_density
(keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside
the selection loop keep only loading/_base_xmap extraction and per-selection
_extracted_base handling, and reuse the cached _computed_density instead of
calling parse or compute_density_from_atomarray again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 180-193: The code mutates the shared Experiment object _exp across
selections causing state leakage; fix by working on a fresh shallow/deep copy
before any mutation — e.g., create a copy of _exp (use copy.deepcopy or convert
to dict first) and perform assignments to that copy (set rscc, base_map_path,
error) and then append the copy to results; ensure you stop mutating the
original _exp (referenced symbols: _exp, exp_dict_copy, selection, results,
rscc) so each selection row is independent.

In `@src/sampleworks/eval/eval_dataclasses.py`:
- Line 193: The CSV parsing currently uses
selection=row["selection"].strip().split(";") which can produce empty or
whitespace-only entries; update the parsing logic where the selection field is
built so you split on ";" then trim each item and filter out empty strings
(e.g., split, map strip, filter out falsy items) and then validate the resulting
list (e.g., ensure it's non-empty or that each item matches allowed choices)
before assigning to the selection field; adjust any error handling to raise a
clear exception or log invalid selections so downstream code that consumes
selection receives a normalized, validated list.

In `@src/sampleworks/eval/structure_utils.py`:
- Around line 369-373: The selection parsing in extract_selection_coordinates()
is inconsistent with apply_selection() and uses atom_array to compute the mask
(fragile for stacked inputs); update the condition to check for ("==", ">", "<",
"<=", ">=", " in ") and, in the else branch, compute mask from working_array
(use working_array.mask(selection) instead of atom_array.mask(selection)); keep
get_mask_from_old_selection_string(...) call for legacy parsing as-is and ensure
variable name mask is assigned from working_array when using the modern
selection path.

In `@src/sampleworks/utils/atom_array_utils.py`:
- Around line 295-303: After building altloc_ids and altloc_list via
find_all_altloc_ids and select_altloc, guard against cases where altloc_list has
fewer than two entries: if altloc_list is empty return an empty result (or the
function's appropriate zero-state), and if it contains exactly one element skip
calling filter_to_common_atoms and use that single selection as atom_arrays
(respecting return_full_array behavior); implement this check just before the
call to filter_to_common_atoms to avoid the downstream generic error.
- Around line 195-203: In find_all_altloc_ids(), hasattr(atom_array,
altloc_label) can be True even when the attribute value is None; change the
guard to fetch the attribute with getattr(atom_array, altloc_label, None),
validate that the returned altloc_ids is not None (and is the expected
array-like), and if it is None raise the same AttributeError (or a clear
ValueError) mentioning altloc_label and that the annotation is missing/None;
then proceed to call np.unique on altloc_ids and return set(altloc_ids.tolist())
- BLANK_ALTLOC_IDS.

---

Nitpick comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 223-241: The loop currently calls get_reference_atomarraystack and
reparses the same reference for every selection because itertools.product over
args.occupancies and protein_config.selection repeats loads; change the logic to
iterate over occupancies first (for occ in args.occupancies), call
get_reference_atomarraystack(protein_config, occ) once, skip if None, then
inside that occupancy iterate over selections (for sel in
protein_config.selection) and call map_altlocs_to_stack(reference_proteins,
selection=translate_selection(sel), return_full_array=True) and store into
reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching
check for (protein_key, occ) remains so each reference is only stored once per
occupancy.

In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 127-165: The code recomputes parsing and computed density inside
the per-selection loop (parse, get_asym_unit_from_structure,
compute_density_from_atomarray), causing O(selections) work; move CIF parsing
and density computation out of the selection loop and cache the results per
experiment/occ_a. Specifically: parse the refined CIF via parse(...) once for
each _exp and produce atom_array/get_asym_unit_from_structure once, then call
compute_density_from_atomarray(...) once to produce a cached _computed_density
(keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside
the selection loop keep only loading/_base_xmap extraction and per-selection
_extracted_base handling, and reuse the cached _computed_density instead of
calling parse or compute_density_from_atomarray again.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14c38a2 and 51ee774.

📒 Files selected for processing (8)
  • scripts/eval/lddt_evaluation_script.py
  • scripts/eval/rscc_grid_search_script.py
  • src/sampleworks/core/rewards/real_space_density.py
  • src/sampleworks/eval/eval_dataclasses.py
  • src/sampleworks/eval/grid_search_eval_utils.py
  • src/sampleworks/eval/structure_utils.py
  • src/sampleworks/utils/atom_array_utils.py
  • src/sampleworks/utils/density_utils.py

Comment thread scripts/eval/rscc_grid_search_script.py Outdated
Comment thread src/sampleworks/eval/eval_dataclasses.py Outdated
Comment thread src/sampleworks/eval/structure_utils.py Outdated
Comment thread src/sampleworks/utils/atom_array_utils.py
Comment thread src/sampleworks/utils/atom_array_utils.py
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but won't hold up over them!

# this should be a temporary measure only until we switch to atomworks style in the RSCC script

logger.warning(
if any(x in selection for x in ("==", ">", "<", "<=", ">=", " in ")):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably sufficient, but I am not sure. Maybe it would be better to detect the pyMOL like selection syntax? We should also write (or use if it exists) an atomworks/pandas <-> pyMOL style translator so that it is easy to port directly into pymol scripts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make an issue to solve this. I have a translation method but it only handles some cases. We definitely need a more complete solution.



def process_exp_with_selection(
exp: Experiment,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - how should we name these given that we will be interacting with the diffuse hub schema? should we align how we describe experiments for ourselves both here and there?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Personally I think the way to do this is to push all the metadata into the CIF files, and then we can do away with this kind of parsing.

Oh, but you're asking whether we should rename the "Experiment" class here? Yes, it would make sense to rename it.

----------
exp: Experiment, a description of the structure generation experiment
protein_config: ProteinConfig, specifying the locations of reference structures and maps
px_seln_refernce_atom_array: AtomArrayStack,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here that would be worth not propogating I think: px_seln_refernce_atom_array -> px_seln_reference_atom_array

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment thread src/sampleworks/utils/density_utils.py
Comment thread scripts/eval/rscc_grid_search_script.py
f"({n_valid}/{n_total} atoms remaining)"
)

# TODO can we use [..., valid_mask] here?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not, though I do think that coords is only going to be 2 dimensional here by requirement of AtomArrayStack or AtomArray

Comment thread src/sampleworks/eval/eval_dataclasses.py
Comment thread src/sampleworks/utils/atom_array_utils.py Outdated
…tom selection capabilities for eval scripts)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants